17 research outputs found

    Learning Markov networks with context-specific independences

    Full text link
    Learning the Markov network structure from data is a problem that has received considerable attention in machine learning, and in many other application fields. This work focuses on a particular approach for this purpose called independence-based learning. Such approach guarantees the learning of the correct structure efficiently, whenever data is sufficient for representing the underlying distribution. However, an important issue of such approach is that the learned structures are encoded in an undirected graph. The problem with graphs is that they cannot encode some types of independence relations, such as the context-specific independences. They are a particular case of conditional independences that is true only for a certain assignment of its conditioning set, in contrast to conditional independences that must hold for all its assignments. In this work we present CSPC, an independence-based algorithm for learning structures that encode context-specific independences, and encoding them in a log-linear model, instead of a graph. The central idea of CSPC is combining the theoretical guarantees provided by the independence-based approach with the benefits of representing complex structures by using features in a log-linear model. We present experiments in a synthetic case, showing that CSPC is more accurate than the state-of-the-art IB algorithms when the underlying distribution contains CSIs.Comment: 8 pages, 6 figure

    Aprendizaje de independencias específicas del contexto en Markov random fields

    Get PDF
    Los modelos no dirigidos o Markov random fields son ampliamente utilizados para problemas que aprenden una distribución desconocida desde un conjunto de datos. Esto es porque permiten representar una distribución eficientemente al hacer explícitas las independencias condicionales que pueden existir entre sus variables. Además de estas independencias es posible representar otras, las Independencias Específicas del Contexto (CSIs) que a diferencia de las anteriores sólo son válidas bajo ciertos valores que pueden tomar subconjuntos de sus variables. Debido a esto son complicadas de representar y aprenderlas desde datos. En este trabajo presentamos un enfoque para representar CSIs en modelos no dirigidos y un algoritmo que las aprende desde datos utilizando tests estadísticos. Mostramos resultados donde los modelos aprendidos por nuestro algoritmo resultan ser mejores o comparables a modelos aprendidos por otros sin utilizar CSIs.Presentado en el XII Workshop Agentes y Sistemas Inteligentes (WASI)Red de Universidades con Carreras en Informática (RedUNCI

    Male sterility and somatic hybridization in plant breeding

    Get PDF
    Plant male sterility refers to the failure in the production of fertile pollen. It occurs spon-taneously in natural populations and may be caused by genes encoded in the nuclear (genicmale sterility; GMS) or mitochondrial (cytoplasmic male sterility; CMS) genomes. Thisfeature has great agronomic value for the production of hybrid seeds, since it prevents self-pollination without the need of emasculation which is time-consuming and cost-intensive.CMS has been widely used in crops, such as corn, rice, wheat, citrus, and several speciesof the family Solanaceae. Mitochondrial genes determining CMS have been uncovered ina wide range of plant species. The modes of action of CMS have been classified in terms ofthe effect they produce in the cell, which ultimately leads to a failure in the production offertile pollen. Male fertility can be restored by nuclear-encoded genes, termed restorer-of-fertility (Rf) factors. CMS from wild plants has been transferred to species of agronomicinterest through somatic hybridization. Somatic hybrids have also been produced togenerate CMS de novo upon recombination of the mitochondrial genomes of two parentalplants or by separating the CMS cytoplasm from the nuclear Rf alleles. As a result, somatichybridization can be used as a highly efficient and useful strategy to incorporate CMS inbreeding programs.Fil: Garcia, Laura Evangelina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; Argentina. Universidad Nacional de Cuyo; ArgentinaFil: Edera, Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; ArgentinaFil: Marfil, Carlos Federico. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; ArgentinaFil: Sánchez Puerta, María Virginia. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; Argentin

    The IBMAP approach for Markov networks structure learning

    Full text link
    In this work we consider the problem of learning the structure of Markov networks from data. We present an approach for tackling this problem called IBMAP, together with an efficient instantiation of the approach: the IBMAP-HC algorithm, designed for avoiding important limitations of existing independence-based algorithms. These algorithms proceed by performing statistical independence tests on data, trusting completely the outcome of each test. In practice tests may be incorrect, resulting in potential cascading errors and the consequent reduction in the quality of the structures learned. IBMAP contemplates this uncertainty in the outcome of the tests through a probabilistic maximum-a-posteriori approach. The approach is instantiated in the IBMAP-HC algorithm, a structure selection strategy that performs a polynomial heuristic local search in the space of possible structures. We present an extensive empirical evaluation on synthetic and real data, showing that our algorithm outperforms significantly the current independence-based algorithms, in terms of data efficiency and quality of learned structures, with equivalent computational complexities. We also show the performance of IBMAP-HC in a real-world application of knowledge discovery: EDAs, which are evolutionary algorithms that use structure learning on each generation for modeling the distribution of populations. The experiments show that when IBMAP-HC is used to learn the structure, EDAs improve the convergence to the optimum

    Anc2vec: Embedding gene ontology terms by preserving ancestors relationships

    No full text
    The gene ontology (GO) provides a hierarchical structure with a controlled vocabulary composed of terms describing functions and localization of gene products. Recent works propose vector representations, also known as embeddings, of GO terms that capture meaningful information about them. Significant performance improvements have been observed when these representations are used on diverse downstream tasks, such as the measurement of semantic similarity between GO terms and functional similarity between proteins. Despite the success shown by these approaches, existing embeddings of GO terms still fail to capture crucial structural features of the GO. Here, we present anc2vec, a novel protocol based on neural networks for constructing vector representations of GO terms by preserving three important ontological features: its ontological uniqueness, ancestors hierarchy and sub-ontology membership. The advantages of using anc2vec are demonstrated by systematic experiments on diverse tasks: visualization, sub-ontology prediction, inference of structurally related terms, retrieval of terms from aggregated embeddings, and prediction of protein-protein interactions. In these tasks, experimental results show that the performance of anc2vec representations is better than those of recent approaches. This demonstrates that higher performances on diverse tasks can be achieved by embeddings when the structure of the GO is better represented. Full source code and data are available at https://github.com/sinc-lab/anc2vec.Fil: Edera, Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Milone, Diego Humberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Stegmayer, Georgina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentin

    Simultaneous Profiling of Chromatin Accessibility and DNA Methylation in Complete Plant Genomes Using Long-Read Sequencing

    No full text
    ABSTRACT Epigenetic regulations, including chromatin accessibility, nucleosome positioning, and DNA methylation intricately shape genome function. However, current chromatin profiling techniques relying on short-read sequencing technologies face limitations in adequately characterising repetitive genomic regions and detecting multiple chromatin features simultaneously. Here, we present Simultaneous Accessibility and DNA Methylation Sequencing (SAM-seq), a robust method leveraging bacterial adenine methyltransferases (m6A-MTases) to label accessible regions in purified plant nuclei. Coupled with Oxford Nanopore Technology sequencing, SAM-seq enables high-resolution profiling of m6A-tagged chromatin accessibility together with cytosine methylation along chromatin fibres in plants. Analysis of naked genomic DNA revealed significant sequence preference biases of m6A-MTases, controllable through a normalisation step. By applying SAM-seq to Arabidopsis and maize nuclei we obtained fine-grained accessibility and DNA methylation landscapes at genome-wide and local scales. We characterised crosstalk between chromatin accessibility and DNA methylation, notably within nucleosomes of genes, TEs, and centromeric repeats. SAM-seq also facilitated the identification of DNA footprints over cis-regulatory regions. Furthermore, using the single-molecule information provided by SAM-seq we unveiled extensive cellular heterogeneity at chromatin domains harbouring antagonistic chromatin marks, suggesting that bivalency reflects cell-specific regulations of gene activity. In summary, we introduce a robust method for acquiring high-resolution accessibility and DNA methylation landscapes across entire plant genomes. Our results underscore the importance of considering the intrinsic substrate preferences of m6A-MTases for reliable chromatin profiling. SAM-seq opens new opportunities to simultaneously study multiple epigenetic features at unprecedented scale, enabling the investigation of non-model species with limited genomic and epigenomic information

    Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks

    No full text
    A representation method is an algorithm that calculates numerical feature vectors for samples in a dataset. Such vectors, also known as embeddings, define a relatively low-dimensional space able to efficiently encode high-dimensional data. Very recently, many types of learned data representations based on machine learning have appeared and are being applied to several tasks in bioinformatics. In particular, protein representation learning methods integrate different types of protein information (sequence, domains, etc.), in supervised or unsupervised learning approaches, and provide embeddings of protein sequences that can be used for downstream tasks. One task that is of special interest is the automatic function prediction of the huge number of novel proteins that are being discovered nowadays and are still totally uncharacterized. However, despite its importance, up to date there is not a fair benchmark study of the predictive performance of existing proposals on the same large set of proteins and for very concrete and common bioinformatics tasks. Therefore, this lack of benchmark studies prevent the community from using adequate predictive methods for accelerating the functional characterization of proteins. In this study, we performed a detailed comparison of protein sequence representation learning methods, explaining each approach and comparing them with an experimental benchmark on several bioinformatics tasks: (i) determining protein sequence similarity in the embedding space; (ii) inferring protein domains and (iii) predicting ontology-based protein functions. We examine the advantages and disadvantages of each representation approach over the benchmark results. We hope the results and the discussion of this study can help the community to select the most adequate machine learning-based technique for protein representation according to the bioinformatics task at hand.Fil: Fenoy, Luis Emilio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Edera, Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Stegmayer, Georgina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentin

    Learning Markov Network Structures Constrained by Context-Specific Independences

    No full text
    This work focuses on learning the structure of Markov networks from data. Markov networks are parametric models for compactly representing complex probability distributions. These models are composed by: a structure and numerical weights, where the structure describes independences that hold in the distribution. Depending on which is the goal of structure learning, learning algorithms can be divided into: density estimation algorithms, where structure is learned for answering inference queries; and knowledge discovery algorithms, where structure is learned for describing independences qualitatively. The latter algorithms present an important limitation for describing independences because they use a single graph; a coarse grain structure representation which cannot represent flexible independences. For instance, context-specific independences cannot be described by a single graph. To overcome this limitation, this work proposes a new alternative representation named canonical model as well as the CSPC algorithm; a novel knowledge discovery algorithm for learning canonical models by using context-specific independences as constraints. On an extensive empirical evaluation, CSPC learns more accurate structures than state-of-the-art density estimation and knowledge discovery algorithms. Moreover, for answering inference queries, our approach obtains competitive results against density estimation algorithms, significantly outperforming knowledge discovery algorithms.Fil: Edera, Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; Argentina. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de Información; ArgentinaFil: Schluter, Federico Enrique Adolfo. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de Información; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; ArgentinaFil: Bromberg, Facundo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; Argentina. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de Información; Argentin

    Towards a comprehensive picture of C-to-U RNA editing sites in angiosperm mitochondria

    No full text
    Key message: Our understanding of the dynamic and evolution of RNA editing in angiosperms is in part limited by the few editing sites identified to date. This study identified 10,217 editing sites from 17 diverse angiosperms. Our analyses confirmed the universality of certain features of RNA editing, and offer new evidence behind the loss of editing sites in angiosperms. Abstract: RNA editing is a post-transcriptional process that substitutes cytidines (C) for uridines (U) in organellar transcripts of angiosperms. These substitutions mostly take place in mitochondrial messenger RNAs at specific positions called editing sites. By means of publicly available RNA-seq data, this study identified 10,217 editing sites in mitochondrial protein-coding genes of 17 diverse angiosperms. Even though other types of mismatches were also identified, we did not find evidence of non-canonical editing processes. The results showed an uneven distribution of editing sites among species, genes, and codon positions. The analyses revealed that editing sites were conserved across angiosperms but there were some species-specific sites. Non-synonymous editing sites were particularly highly conserved (~ 80%) across the plant species and were efficiently edited (80% editing extent). In contrast, editing sites at third codon positions were poorly conserved (~ 30%) and only partially edited (~ 40% editing extent). We found that the loss of editing sites along angiosperm evolution is mainly occurring by replacing editing sites with thymidines, instead of a degradation of the editing recognition motif around editing sites. Consecutive and highly conserved editing sites had been replaced by thymidines as result of retroprocessing, by which edited transcripts are reverse transcribed to cDNA and then integrated into the genome by homologous recombination. This phenomenon was more pronounced in eudicots, and in the gene cox1. These results suggest that retroprocessing is a widespread driving force underlying the loss of editing sites in angiosperm mitochondria.Fil: Edera, Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; ArgentinaFil: Gandini, Carolina Lia. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; ArgentinaFil: Sánchez Puerta, María Virginia. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; Argentin
    corecore